TMExplorer: A tumour microenvironment single-cell RNAseq database and search tool

Motivation The tumour microenvironment (TME) contains various cells including stromal fibroblasts, immune and malignant cells, and its composition can be elucidated using single-cell RNA sequencing (scRNA-seq). scRNA-seq datasets from several cancer types are available, yet we lack a comprehensive database to collect and present related TME data in an easily accessible format. Results We therefore built a TME scRNA-seq database, and created the R package TMExplorer to facilitate investigation of the TME. TMExplorer provides an interface to easily access all available datasets and their metadata. The users can search for datasets using a thorough range of characteristics. The TMExplorer allows for examination of the TME using scRNA-seq in a way that is streamlined and allows for easy integration into already existing scRNA-seq analysis pipelines.


Introduction
Single-cell RNA sequencing (scRNA-seq) is a new technology that has emerged as an important tool to measure gene expression for individual cells, enabling the examination of cellular heterogeneity and tissue composition with incredible precision. This has been particularly applicable in cancer research for the study of tumour composition, heterogeneity and phenotype, all of which are directly impacted by the tumour-microenivronment (TME). TMEs are composed of different stromal and cancer cell types whose interactions likely dictate different aspects of tumour behaviour, such as metastasis [1][2][3][4]. Combined with scRNA-seq analysis methods, scRNA-seq enables us to dissect the TME into individual cells and investigate the different cell subpopulations that exist. Such investigations into the TME are becoming increasingly important, as tumour composition and heterogeneity can influence cancer progression and the outcome of cancer therapy [1,[4][5][6][7][8][9].
With the advancement of scRNA-seq in cancer research, the number of TME datasets that are generated continues to increase, yet they can be difficult to access. Raw sequence reads generated by scRNAseq can be shared through online archives, such as the Sequence Read Archive (SRA) [10], however they exist as large files that require further processing to be analyzed, making data access a challenge. Already processed scRNA-seq data containing gene expression information can be accessed through online archives, such as the Genome Expression Omnibus (GEO) [11], and can be more easily downloaded for use in one's own analysis. Furthermore, to manage the growing abundance of publicly available scRNA-seq data, proper quality control and curation of datasets must be done [12,13]. Currently, several online databases offer curated collections of public scRNA-seq datasets, such as PanglaoDB [12], scRNASeqDB [13], JingleBells [14] and the Single Cell Portal created by the Broad Institute of MIT and Harvard [15]. Most existing scRNA-seq databases include a mixture of samples from normal tissues and tissues affected by cancer or other diseases [12][13][14], while others focus primarily on samples from normal tissues [16,17]. A recently published toolkit called CReSCENT [18] contains only cancer scRNA-seq data, however it mainly acts as a cancer data analysis pipeline rather than a database. A comprehensive database for the collection and sharing of TME scRNA-seq datasets from a range of tumour types does not yet exist, and researchers interested in using publicly available TME data must search through several databases to collect relevant datasets for their study. A database of TME scRNA-seq samples will thus streamline the data collection steps required for researching cancer at a single-cell level, lowering the barrier for entry to this type of study.
It is likewise important that scRNA-seq databases are designed to facilitate streamlined data collection and analysis. This can include a search tool that allows users to select datasets based on desired characteristics. While existing databases include search tools, they provide few options in characteristics users are able to search for and often require users to browse through a metadata table prior to selecting datasets of interest. Furthermore, they are designed as webbased tools, and thus are not intended to be integrated into workflows [12][13][14][15]. Workflow integration would enable users to access data directly in their pipelines, thus automating the data collection process and increasing analysis efficiency. A scRNA-seq database that is provided as an R-package and contains a comprehensive search tool which allows users to select datasets based on a wider variety of characteristics would make the data collection process easier for researchers.
Here, we present a curated collection of tumour scRNA-seq datasets made available as an R-package called TMExplorer. TMExplorer contains publicly available scRNA-seq datasets specific to TMEs from various tumour types collected from different scRNA-seq studies [1-3, 5-9, 19-53] and online databases [11,54]. In addition to gene expression data, TMExplorer contains the corresponding cell type annotations and gene-signature information for several datasets, and provides a search tool that enables users to search for multiple datasets according to 13 different characteristics (Table 1). When selecting datasets, users can review the metadata table first or they can retrieve datasets that match specific criteria without having to browse through the metadata table. While online databases require users to download a given dataset prior to use, TMExplorer allows users to access and search available datasets within R. Users can thus input the data directly into existing pipelines with only a few commands. Each dataset can be used directly within R as a SingleCellExperiment object, or exported as a gene expression matrix in multiple formats for use with other applications. Users interested in validating scRNA-seq analysis algorithms, as they apply to TME data, can easily access this information through TMExplorer and incorporate it into their pipelines. Altogether, TMExplorer makes it easier for researchers to access and share TME scRNA-seq datasets, facilitating the study of TMEs at the single-cell level in the field of cancer research.

Data collection
In order to collect the datasets, we searched the National Center for Biotechnology Information (NCBI) [55] for relevant scRNA-seq studies using the following keywords: single cell RNA sequencing, tumour, cancer, tumour microenvironment, and malignant. We then carefully reviewed the published literature and any associated data to confirm if they matched our criteria. Datasets were included in our data collection if they were publicly available as processed data, were generated by scRNA-seq and if they consisted of TME expression data. A total of 48 datasets originating from different types of human and mouse tumours were collected from online sources such as the NCBI's Gene Expression Omnibus (GEO) [11], ArrayExpress [54], and Github [56]. Out of the 48 datasets we collected, 44 datasets originated from human tumours and 4 datasets originated from mouse tumours (Fig 1). Descriptions of the collected datasets are provided in Table 2. Metadata for each dataset, such as tumour type and number of cells sequenced, were collected from descriptions in the corresponding publications and/or from the online sources that the datasets were obtained from. If publicly available, we also retrieved cell-type annotations and/or gene signature information that accompanied the datasets. All data is hosted on FigShare [57] under the TMExplorer project.

Data curation
Datasets found on GEO often contain extra information such as Ensemble ID or chromosome region in additional rows or columns. We modified all datasets using R to ensure they followed a similar genes-by-cells format with the gene column serving as an index. If any dataset is published as separate samples, samples are merged into a single file with a suffix identifying the sample appended to cell IDs, so users may separate the samples and perform batch correction if necessary. Having a similar format for datasets reduces the preprocessing required to use this data in other analysis pipelines. There are three main components to each dataset in our database: (1) gene-by-cell expression matrices; (2) cell type labels; and (3) gene signatures. The cell type labels are R dataframes with two columns; one contains every cell barcode present in the expression matrix, and the other one contains that cell's type. The gene signatures are stored in R dataframes containing one column per cell type, with a list of genes that are differentially expressed by that cell type and reported in the original paper in which each dataset was first introduced. All data for each dataset is accessible within a single object in order to make it as easy to use as possible.
Since R BioConductor has existing infrastructure for working with scRNA-seq data [58,59], we used it as the platform to build our package upon. In order to maintain compatibility with existing Bioconductor software, we return all datasets as SingleCellExperiment objects [59]. Fig 2 shows the structure of a SingleCellExperiment object, where the expression data is stored as a named assay, cell type labels (if present) are stored under colData(), and all other information is stored in a metadata list.
• Expression Data: Named assays allow certain formats to be easily accessed with getter functions such as counts() and tpm(), while other formats can still be accessed with the assay() getter function [59]. All SingleCellExperiments have one assay named according to the type of score (e.g. Counts and TPM) represented in that object. Calling assay() returns an expression matrix with rows of genes and columns of cells.
• Cell Type Labels: ColData stores metadata for the columns in the assay matrix. In our case this refers to the cell type annotations, if they are available. ColData is a dataframe that   always has one row for every column in the assay matrix, ensuring that there is a label present for every cell. If the cell type is not available for a given cell, it is labelled as "unknown".
• Metadata: The metadata list serves to store any other information that does not fit into a preexisting attribute of the SingleCellExperiment object, and is accessed with the metadata() function. This named list contains the signature gene sets, available score types, tumour and host organism type, sequencing technology, author, and all other descriptive information as strings. All information that is available in the metadata table can be accessed by calling the query function of TMExplorer (i.e. queryTME) with the metadata_only parameter set to true.

Metadata
After collecting the datasets, corresponding metadata was compiled into a table which serves as the core of the package (S1 Fig). The metadata table contains information such as GEO accession, author, journal, year, PMID, sequencing technology, expression score type(s), source organism, type of cancer, number of patients, tumours, cells and genes, and the database that the data was obtained from (S1 Table). All items in the metadata table were chosen as either entities that distinguish one dataset from the others or criteria that may make a dataset or group of datasets interesting to researchers (e.g. a specific tumour type or availability of cell labels or gene signatures). Users can view the available data using the metadata table and decide which dataset best fits their needs.

Database query
TMExplorer provides a query function (i.e. queryTME) that users can employ in order to select multiple datasets based on their desired characteristics (S2 Fig). For example, users can select

PLOS ONE
TMExplorer: A tumour microenvironment single-cell RNAseq database and search tool specific studies by PMID or GEO accession, or filter subsets by sequencing technology, whether cell type labels or cell type signature gene sets are available etc. Sequencing technology, score type, organism, tumour type, and year were all chosen as search parameters because they represent differences in the type of data and make it easier to find data that fits the needs of different studies. Some datasets may publish multiple tumour types under the same study.
TMExplorer is able to handle this by having multiple rows of different datasets from the same study. In these cases, users will need to provide multiple search parameters to select a single row, for instance the GEO accession and tumour type for a study that contains multiple cancers. We have also made it possible to search for datasets for which the cell labels and gene signatures are available. This facilitates developing and testing algorithms that require specific types of dataset information. For example, testing cell classification algorithms requires cell labels that can be used as a gold standard, and many existing algorithms require gene signatures that represent the cell types in the dataset [60,61].

Alternative experiments
For several datasets, gene expressions are available in multiple score types including raw counts and normalized data by FPKM, TPM or CPM. In order to store each dataset in multiple score types, we used nested SingleCellExperiments objects with the alternative experiments

PLOS ONE
(altExps) concept. Alternative experiments are guaranteed to have the same dimensions as the primary object, but can be kept separate for use in other pipelines [59]. This allows users to download multiple types of scoring for use in different steps of analysis while still being able to access each dataset through a single object. Being able to download multiple score types allows our datasets to be used in a variety of algorithms that require a specific type of score, and keeping them separated as nested objects prevents accidentally applying an algorithm to the wrong score type.

Dense vs. sparse data formats
In order to reduce the memory requirements for working with large datasets, expression data is optionally available as a sparse matrix. We implemented sparse matrices using the dgCMatrix class from R Matrix [62]. This reduces memory usage by only storing non-zero expression values. With sparse matrices, the memory required to store a dataset is reduced by as much as 8 Gb for a dataset with 51,775 cells and 22,533 genes. It should be noted that not all software packages are compatible with sparse matrices, and converting large datasets from sparse to dense may crash R on machines with low memory. Thus users should confirm that their algorithms support sparse matrices before using them. By default, TMExplorer returns dense matrices to avoid these problems.

Exporting data in multiple formats
Several tools for scRNA-seq analysis are written in R and therefore a SingleCellExperiment object can easily be incorporated into these pipelines and tools. However, many other analysis tools are written in Python or as webapps [18,63,64]. To facilitate the use of TMExplorer with these tools, we wrote a function saveTME that writes individual TME datasets to disk as CSV or Matrix Market files, depending on whether data was loaded as dense or sparse matrices by queryTME, respectively. SaveTME takes a SingleCellExperiment object and a path to an output directory as parameters and saves the gene expression matrix, cell type labels, and cell type signature gene sets to disk. The resulting files can then be converted as needed and used in other applications.

Adding new datasets
We keep TMExplorer updated with new datasets as they get published. Additionally, users of the package doing their own novel research will have access to an issue template on Github where they can submit their data for inclusion. The interested users will need to provide their scRNAseq data as raw counts or normalized data and the corresponding metadata. Since TMExplorer is open source, those same users can create a fork of the repository and build it from source with their own data for pre-publication work. Users wishing to fork the repository for their own data need only replace or add to the metadata table used by the package and update any documentation or function names to reflect the new data. If users are adding new TME data, no functions need to be changed since this package already uses TME data. Those users who are interested in adopting the package for other types of single-cell sequencing data (such as sc-ATAC seq) can do so by changing documentation and functions to reflect the new data type.

Overview of the TMExplorer package
To make it as easy as possible to integrate TMExplorer into other pipelines, all interactions with the package are done directly in R. Here, the queryTME function serves as the primary interface for the package, allowing users to view the metadata for all available datasets, or select a subset of datasets according to descriptive criteria (Fig 3A). queryTME provides a set of parameters (Table 1) used to select a subset of datasets according to characteristics. To review the available datasets, the metadata_only parameter should be set to TRUE when querying the package, and a table describing the datasets will be returned instead of the datasets themselves. The search parameters can be used to find relevant data without requiring users to review the metadata table first, lowering the barrier for use. For example, users looking for a certain type of cancer, such as melanoma, can search using queryTME(tumour_type ="Melanoma") without needing to first examine the metadata for datasets containing melanoma cancers. After querying the database, a list of SingleCellExperiment objects is returned. The objects in this list can then be passed to any other algorithms that accept a SingleCellExperiment object, sparse dgCMatrix, or dense gene expression matrix for inclusion in a pipeline (Fig 4). Alternatively, the saveTME function can be used to write the returned data to disk for further manipulation or use in applications outside of R (Fig 3B). Fig 4 shows how saveTME can be used to save data for analysis in Python. In order to maintain consistency, the returned value is always a list of results, whether or not multiple datasets match the query.

PLOS ONE
TMExplorer: A tumour microenvironment single-cell RNAseq database and search tool [46] (Fig 5E). 13 out of 28 cancer types have more than one associated dataset (Fig 5E). Also, 6 out of 48 datasets are from rare cancers (incidence rate of < 6 in a million persons), including merkel cell carcinoma [53], thymic carcinoma [52], ewing sarcoma [50], astrocytoma [21], oligodendroglioma [8], and ependymoma [46]. Numbers of cells and genes vary across datasets and fall within the range of 4,343-57,915 genes and 74-208,506 cells (Fig 5F and S1 Table). The datasets are sequenced by different sequencing technologies including 10x Genomics, SMART-seq2 and Fluidigm C1 (Fig 5B). Each dataset is provided as processed gene expression data, and are provided either as raw counts or normalized data (e.g. TPM, RPKM, and RPM) (Fig 5D). We did not include raw scRNA-seq data (i.e. FASTQ files) in our collection because these files tend to be very large and can be accessed through the SRA, if available. Out of the 48 datasets, cell-type annotations are also provided for 16 datasets and gene signature information is provided for 18 datasets (Fig 5C), so that users may access and use this information in their analyses. Also, for 10 datasets both cell type annotations and gene signatures are available ( Fig  5C). Users can browse through the available datasets using the metadata table and then choose which dataset(s) they would like to analyze. Users can also save the datasets for use outside of R, for instance in Python or web-based analysis pipelines.

TMExplorer search capability
An important feature of TMExplorer is that it acts as both a database and search tool that can be easily implemented in one's own workflow. Some other currently available scRNA-seq databases have a search function, but cannot be easily integrated into workflows because they are web-based [12][13][14][15]. Currently available R-based scRNA-seq databases lack built in search tools, requiring users to access vignettes to see the available data before it can be retrieved for An example workflow of using TMExplorer to obtain datasets for the downstream analysis using Python and R. Users start by using queryTME to return all datasets that have cell type labels and cell type signature gene sets, which will get a list of matching datasets contained in SingleCellExperiment objects. Then, for R based algorithms, users can pass the SingleCellExperiments directly if that is supported, or users can pass the individual components required. For Python based algorithms, saveTME can be used to save the files for each dataset to disk, which can then be opened in Python for analysis. https://doi.org/10.1371/journal.pone.0272302.g004

PLOS ONE
TMExplorer: A tumour microenvironment single-cell RNAseq database and search tool use in a pipeline [16]. TMExplorer provides a search tool that allows users to search for datasets that fit their needs by tumour type, sequencing technology, source organism, and more (Table 1), all from the R command line. This makes TMExplorer an improvement over both R-based and web-based databases because users are able to browse and query data from the same console they are using for analysis. By including a search tool and database in a single package, TMExplorer provides a single point of entry to include TME scRNA-seq data into data analysis pipelines. In Fig 6, we provide a flowchart that shows various steps involved in querying TMExplorer, obtaining the datasets of interest, saving them on the local machines, and performing further analysis in R or other programs.

Case studies
In this section, we bring two example applications where TMExplorer can be used to facilitate data analysis. In the case study 1, we show how TMExplorer can be combined with automated cell-type identification algorithms to identify different cell types in TME scRNA-seq data. Here, we also show how users can return datasets with both the signature gene sets and gold standard annotations needed for testing cell-type identification. In case study 2, we show how TMExplorer can be integrated with the algorithms for inferencing copy-number variations in individual cells and facilitate the separation of malignant and non-malignant cells in multiple tumour scRNA-seq datasets of the same cancer type.

PLOS ONE
Case study 1: Identifying different cell types in TME scRNA-seq data. Often, when using TME scRNA-seq data, we are interested in the cellular composition of the dataset. In order to find this, automated cell type identification algorithms are used. This is usually done by first clustering the cells, and then assigning appropriate cell type labels to each cluster [65]. In Fig 7, we show how TMExplorer can be combined with a clustering method (e.g. Seurat [66]) and a cluster labelling method (e.g. GSVA [61]) to create a workflow for the identification of cell populations within a dataset. Seurat requires only the gene expression matrix to perform clustering, but GSVA requires a list of cell-type signature gene sets in addition to the expression matrix. TMExplorer can return all of the datasets that have signature gene sets available using queryTME(has_signatures = TRUE). If after identifying the cell types within a dataset, users want to assess the performance of their workflow by comparing the automated annotations to those reported alongside the dataset, the has_truth = TRUE parameter can be added to queryTME to only return datasets that have gold standard labels available. Seurat and GSVA can be replaced by any other tool that accepts a SingleCellExperiment object or a matrix of gene expression values, providing flexibility for users to incorporate TMExplorer into their own workflows.

PLOS ONE
Case study 2: Inferencing copy number variations in multiple datasets of the same cancer type. Single cell sequencing is an important tool that enables the dissection of TMEs into malignant and non-malignant cells. Researchers interested in comparing the tumour composition across different datasets of a specific cancer type would have to collect datasets from different sources prior to application of separation methods. With TMExplorer, users can easily access multiple datasets of a specific tumour type, as well as the accompanying cell type annotations and/or gene signature information from one location, thus avoiding inconsistencies when acquiring data from different databases. TMExplorer can be easily incorporated with other packages into workflows for the analysis of scRNA-seq data, therefore enabling users to access and use the data entirely within R. Fig 8 displays an example workflow that uses queryTME(tumour_type = "Glioblastoma") to retrieve datasets of a specific cancer type (i.e. glioblastoma) for use in the downstream analysis. In this example, we retrieved glioblastoma datasets from the TMExplorer database as Single-CellExperiment objects and converted them to gene expression count data matrices. We then applied a copy number variation (CNV) inferencing method called CONICSmat [67] to each of the datasets individually, and generated heatmaps displaying the inferred CNV patterns. This allowed us to separate malignant and non-malignant cells considering their long-range CNV patterns. The proportion of malignant and non-malignant cells and the patterns of CNV across the different datasets can then be compared. A case study on using TMExplorer to identify cell types. A case study showing how TMExplorer can be used in order to obtain datasets for cell cluster labelling via Seurat and GSVA. queryTME can be used to return those datasets which have both gene signatures and cell type annotations required for testing the automated identification of cell types. The expression data can be passed to Seurat for cell clustering, and the gene signatures can be used by GSVA to identify the cell types in Seurat's clusters. Finally, the cell type annotations can be used as the truth labels to measure the performance of the results obtained by Seurat clustering followed by GSVA. https://doi.org/10.1371/journal.pone.0272302.g007

Discussion
The emergence of single-cell RNA sequencing has enabled the study of tumour composition and phenotype. With the increasing use of scRNA-seq in cancer research, scRNA-seq data from TMEs continues to be generated and published. In order to streamline the data collection process A case study on using TMExplorer for inferring CNVs. A case study showing how TMExplorer can be used to obtain multiple datasets for a specific tumour type, to be used with CNV-based separation methods, such as CONICSmat. QueryTME returns datasets of a specific tumour type, such as Glioblastoma. These datasets can then be inputted directly into large-scale CNV inferencing methods, such as CONICSmat. https://doi.org/10.1371/journal.pone.0272302.g008

PLOS ONE
for researchers interested in studying the TME, we created a curated database of TME scRNA-seq datasets, made available as an R-package called TMExplorer. Here we have built a database using a variety of cancers from multiple sources. We searched NCBI [11,55] for TME scRNA-seq datasets that contain gene expression data, as well as comprehensive metadata such as tumour type, sequencing technology, cell type annotations, and gene signatures. In total, 48 datasets representing 26 different human cancer types and 4 different mouse cancer types are represented, along with their cell type annotations and cell type signature gene sets if they were available.
TMExplorer addresses a gap in currently available scRNA-seq databases by providing a focused, easily accessible database as an R package. TMExplorer has several advantages over other currently available scRNA-seq databases, the most prominent being: 1. Existing curated scRNA-seq databases consist of mostly normal tissue or non-cancer data and relatively few cancer datasets. To allow researchers to easily locate and access TME scRNA-seq data, we curated publicly available TME datasets and made them available in a database accessible as an R package. With TMExplorer, researchers can access all publicly available TME scRNA-seq datasets from a single location and can also return multiple datasets that match their desired criteria with a single command.
2. TMExplorer provides a variety of search parameters ( Table 1) that can be used to return a subset of the available data that matches specific criteria. The parameters were designed so that users can search for matching datasets without having to first view a list of all available datasets, making it easier and faster to access data of interest.
3. The majority of existing scRNA-seq databases can only be accessed online as web-based tools and are not easily incorporated into pipelines for analysis of scRNA-seq data. Since many researchers use R or Python for their analyses, we chose to provide TMExplorer as an R-package so that it may be easily integrated into existing pipelines.
4. Some analyses require more than just gene expression information, and TMExplorer provides cell type annotations and cell type signature gene sets alongside gene expression matrices, where they are available. This facilitates the use of a wider range of analysis methods without requiring additional work from the researchers.
We regularly maintain TMExplorer and add new datasets to our database as they get published. Additionally, we have provided an issue template and vignette on GitHub showing how users can process their data and submit it for inclusion in the package. Users who have found new published datasets or sequenced their own should read the formatting instructions and open a new issue using our template. The users who want their dataset to be included in TMExplorer need to provide a description of the dataset, a link to the source for the dataset, a link to the dataset files that will be added to the package, and the completed metadata table.
TMExplorer is generalisable to many other sources, including both single-cell and bulk sequencing data. We have recently worked on adopting it for the scATAC-seq data in scA-TAC.Explorer BioConductor package [68].
In summary, TMExplorer allows researchers to easily access, share and integrate TME scRNA-seq data into their own analysis pipelines. TMExplorer can be used to access data needed for the validation of new algorithms and to allow researchers interested in the tumour microenvironment to study specific types of cancer. A set of searchable parameters can be used to filter scRNA-seq datasets. The users can search for specific datasets using user-specified parameters, and return one specific dataset as a SingleCellExperiment object for downstream analysis. (TIF) S1 Table. Metadata of TMExplorer. The metadata table contains information such as GEO accession, author, journal, year, PMID, sequencing technology, expression score type(s), source organism, type of cancer, number of patients, tumours, cells and genes, and the database that the data was obtained from. (XLSX)